Recurrent Neural networks

RNN

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior.

keras.layers.recurrent.SimpleRNN(units, activation='tanh', use_bias=True, 
                                 kernel_initializer='glorot_uniform', 
                                 recurrent_initializer='orthogonal', 
                                 bias_initializer='zeros', 
                                 kernel_regularizer=None, 
                                 recurrent_regularizer=None, 
                                 bias_regularizer=None, 
                                 activity_regularizer=None, 
                                 kernel_constraint=None, recurrent_constraint=None, 
                                 bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)

Arguments:

units: Positive integer, dimensionality of the output space.
activation: Activation function to use (see activations). If you pass None, no activation is applied (ie. "linear" activation: a(x) = x).
use_bias: Boolean, whether the layer uses a bias vector.
kernel_initializer: Initializer for the kernel weights matrix, used for the linear transformation of the inputs. (see initializers).
recurrent_initializer: Initializer for the recurrent_kernel weights matrix, used for the linear transformation of the recurrent state. (see initializers).
bias_initializer: Initializer for the bias vector (see initializers).
kernel_regularizer: Regularizer function applied to the kernel weights matrix (see regularizer).
recurrent_regularizer: Regularizer function applied to the recurrent_kernel weights matrix (see regularizer).
bias_regularizer: Regularizer function applied to the bias vector (see regularizer).
activity_regularizer: Regularizer function applied to the output of the layer (its "activation"). (see regularizer).
kernel_constraint: Constraint function applied to the kernel weights matrix (see constraints).
recurrent_constraint: Constraint function applied to the recurrent_kernel weights matrix (see constraints).
bias_constraint: Constraint function applied to the bias vector (see constraints).
dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

Backprop Through time

Contrary to feed-forward neural networks, the RNN is characterized by the ability of encoding longer past information, thus very suitable for sequential models. The BPTT extends the ordinary BP algorithm to suit the recurrent neural architecture.

Reference: Backpropagation through Time



In [1]:

    
%matplotlib inline



In [3]:

    
import numpy as np
import pandas as pd
#import theano
#import theano.tensor as T
import keras 

import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# -- Keras Import
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.preprocessing import image

from keras.datasets import imdb
from keras.datasets import mnist

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D

from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU, SimpleRNN

from keras.layers import Activation, TimeDistributed, RepeatVector
from keras.callbacks import EarlyStopping, ModelCheckpoint









    



Using TensorFlow backend.

IMDB sentiment classification task

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.

IMDB provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

http://ai.stanford.edu/~amaas/data/sentiment/

Data Preparation - IMDB



In [4]:

    
max_features = 20000
maxlen = 100  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print("Loading data...")
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Example:')
print(X_train[:1])

print("Pad sequences (samples x time)")
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)









    



Loading data...
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
17465344/17464789 [==============================] - 14s 1us/step
25000 train sequences
25000 test sequences
Example:
[ list([1, 5, 13, 384, 3282, 641, 14, 22, 944, 689, 1463, 381, 44, 289, 5, 6, 320, 640, 5, 164, 724, 15, 10, 10, 50, 66, 218, 99, 76, 8, 135, 44, 14, 4216, 85, 74, 15, 261, 12, 47, 6, 378, 7, 66, 52, 1801, 91, 7, 12, 218, 55, 163, 885, 127, 12, 157, 33, 32, 17, 6, 883, 89, 44, 17, 6, 731, 212, 24, 23, 129, 113, 91, 7, 4, 414, 9, 96, 99, 1035, 8, 30, 3495, 76, 329, 1139, 10, 10, 803, 66, 2, 9, 4, 863, 9, 24, 78, 33, 32, 14, 20, 100, 28, 77, 38, 76, 53, 262, 19, 32, 4, 1136, 1152, 23, 49, 7, 4, 6514, 771, 11, 63, 108, 26, 6525, 601, 19, 4980, 4414, 39, 1418, 4, 22, 13723, 4, 2759, 2, 3641, 7, 4, 3534, 2, 148, 32, 6898, 1539, 10558, 18, 11834, 37, 220, 210, 901, 327, 857, 21, 305, 7, 5885, 51, 144, 28, 77, 6, 2, 7, 640, 12, 2, 32, 7, 44, 289, 234, 8, 14, 3641, 5, 1102, 23, 11, 4856, 7, 4, 598, 835, 883, 10, 10, 4, 172, 19, 3342, 8724, 8906, 109, 4, 2, 532, 4012, 20, 323, 8724, 1532, 127, 6, 52, 292, 19, 51, 442, 348, 21, 442, 348, 2201, 164, 45, 32, 2, 2582, 15, 272, 55, 6521, 11, 2096, 19, 49, 7, 4, 183, 2002, 557, 44, 381, 120, 4, 153, 10, 10, 11, 4, 130, 12, 9, 254, 8, 391, 51, 93, 8724, 1532, 2686, 3338, 308, 3843, 5, 1514, 5257, 1913, 8860, 14, 4162, 1693, 63, 8284, 40, 6, 14307, 7, 4, 119, 2074, 11, 192, 17, 4, 154, 975, 271, 36, 144, 28, 1551, 4, 229, 5, 814, 4, 855, 12, 62, 242, 97, 6, 128, 65, 38, 140, 1404, 5, 376, 178, 1057, 51, 81, 25, 28, 23, 134, 381, 15, 188, 98, 8, 977, 11, 14])]
Pad sequences (samples x time)
X_train shape: (25000, 100)
X_test shape: (25000, 100)

Model building



In [7]:

    
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(SimpleRNN(128))  
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam')

print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, epochs=1, 
          validation_data=(X_test, y_test))









    



Build model...
Train...






    



/Users/valerio/anaconda3/envs/deep-learning-pydatait-tutorial/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:2094: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
  warnings.warn('\n'.join(msg))






    



Train on 25000 samples, validate on 25000 samples
Epoch 1/1
25000/25000 [==============================] - 104s - loss: 0.7329 - val_loss: 0.6832






    Out[7]:





<keras.callbacks.History at 0x138767780>

LSTM

A LSTM network is an artificial neural network that contains LSTM blocks instead of, or in addition to, regular network units. A LSTM block may be described as a "smart" network unit that can remember a value for an arbitrary length of time.

Unlike traditional RNNs, an Long short-term memory network is well-suited to learn from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.

keras.layers.recurrent.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
                            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
                            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, 
                            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
                            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, 
                            dropout=0.0, recurrent_dropout=0.0)

Arguments

units: Positive integer, dimensionality of the output space.
activation: Activation function to use If you pass None, no activation is applied (ie. "linear" activation: a(x) = x).
recurrent_activation: Activation function to use for the recurrent step.
use_bias: Boolean, whether the layer uses a bias vector.
kernel_initializer: Initializer for the kernel weights matrix, used for the linear transformation of the inputs.
recurrent_initializer: Initializer for the recurrent_kernel weights matrix, used for the linear transformation of the recurrent state.
bias_initializer: Initializer for the bias vector.
unit_forget_bias: Boolean. If True, add 1 to the bias of the forget gate at initialization. Setting it to true will also force bias_initializer="zeros". This is recommended in Jozefowicz et al.
kernel_regularizer: Regularizer function applied to the kernel weights matrix.
recurrent_regularizer: Regularizer function applied to the recurrent_kernel weights matrix.
bias_regularizer: Regularizer function applied to the bias vector.
activity_regularizer: Regularizer function applied to the output of the layer (its "activation").
kernel_constraint: Constraint function applied to the kernel weights matrix.
recurrent_constraint: Constraint function applied to the recurrent_kernel weights matrix.
bias_constraint: Constraint function applied to the bias vector.
dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

GRU

Gated recurrent units are a gating mechanism in recurrent neural networks.

Much similar to the LSTMs, they have fewer parameters than LSTM, as they lack an output gate.

keras.layers.recurrent.GRU(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
                           kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
                           bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, 
                           bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, 
                           recurrent_constraint=None, bias_constraint=None, 
                           dropout=0.0, recurrent_dropout=0.0)

Your Turn! - Hands on Rnn



In [ ]:

    
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))

# !!! Play with those! try and get better results!
#model.add(SimpleRNN(128))  
#model.add(GRU(128))  
#model.add(LSTM(128))  

model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam')

print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, 
          epochs=4, validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Convolutional LSTM

This section demonstrates the use of a Convolutional LSTM network.

This network is used to predict the next frame of an artificially generated movie which contains moving squares.

Artificial Data Generation

Generate movies with 3 to 7 moving squares inside.

The squares are of shape $1 \times 1$ or $2 \times 2$ pixels, which move linearly over time.

For convenience we first create movies with bigger width and height (80x80) and at the end we select a $40 \times 40$ window.



In [1]:

    
# Artificial Data Generation
def generate_movies(n_samples=1200, n_frames=15):
    row = 80
    col = 80
    noisy_movies = np.zeros((n_samples, n_frames, row, col, 1), dtype=np.float)
    shifted_movies = np.zeros((n_samples, n_frames, row, col, 1),
                              dtype=np.float)

    for i in range(n_samples):
        # Add 3 to 7 moving squares
        n = np.random.randint(3, 8)

        for j in range(n):
            # Initial position
            xstart = np.random.randint(20, 60)
            ystart = np.random.randint(20, 60)
            # Direction of motion
            directionx = np.random.randint(0, 3) - 1
            directiony = np.random.randint(0, 3) - 1

            # Size of the square
            w = np.random.randint(2, 4)

            for t in range(n_frames):
                x_shift = xstart + directionx * t
                y_shift = ystart + directiony * t
                noisy_movies[i, t, x_shift - w: x_shift + w,
                             y_shift - w: y_shift + w, 0] += 1

                # Make it more robust by adding noise.
                # The idea is that if during inference,
                # the value of the pixel is not exactly one,
                # we need to train the network to be robust and still
                # consider it as a pixel belonging to a square.
                if np.random.randint(0, 2):
                    noise_f = (-1)**np.random.randint(0, 2)
                    noisy_movies[i, t,
                                 x_shift - w - 1: x_shift + w + 1,
                                 y_shift - w - 1: y_shift + w + 1,
                                 0] += noise_f * 0.1

                # Shift the ground truth by 1
                x_shift = xstart + directionx * (t + 1)
                y_shift = ystart + directiony * (t + 1)
                shifted_movies[i, t, x_shift - w: x_shift + w,
                               y_shift - w: y_shift + w, 0] += 1

    # Cut to a 40x40 window
    noisy_movies = noisy_movies[::, ::, 20:60, 20:60, ::]
    shifted_movies = shifted_movies[::, ::, 20:60, 20:60, ::]
    noisy_movies[noisy_movies >= 1] = 1
    shifted_movies[shifted_movies >= 1] = 1
    return noisy_movies, shifted_movies

Model



In [2]:

    
from keras.models import Sequential
from keras.layers.convolutional import Conv3D
from keras.layers.convolutional_recurrent import ConvLSTM2D
from keras.layers.normalization import BatchNormalization
import numpy as np
from matplotlib import pyplot as plt

%matplotlib inline









    



Using TensorFlow backend.

We create a layer which take as input movies of shape (n_frames, width, height, channels) and returns a movie of identical shape.



In [3]:

    
seq = Sequential()
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   input_shape=(None, 40, 40, 1),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
               activation='sigmoid',
               padding='same', data_format='channels_last'))
seq.compile(loss='binary_crossentropy', optimizer='adadelta')

Train the Network

Beware: This takes time (~3 mins per epoch on my hardware)



In [4]:

    
# Train the network
noisy_movies, shifted_movies = generate_movies(n_samples=1200)
seq.fit(noisy_movies[:1000], shifted_movies[:1000], batch_size=10,
        epochs=20, validation_split=0.05)









    



Train on 950 samples, validate on 50 samples
Epoch 1/50
950/950 [==============================] - 180s - loss: 0.3293 - val_loss: 0.6113
Epoch 2/50
950/950 [==============================] - 181s - loss: 0.0629 - val_loss: 0.4206
Epoch 3/50
950/950 [==============================] - 180s - loss: 0.0187 - val_loss: 0.2585
Epoch 4/50
950/950 [==============================] - 180s - loss: 0.0062 - val_loss: 0.2087
Epoch 5/50
950/950 [==============================] - 179s - loss: 0.0134 - val_loss: 0.1884
Epoch 6/50
950/950 [==============================] - 180s - loss: 0.0024 - val_loss: 0.1025
Epoch 7/50
950/950 [==============================] - 179s - loss: 0.0013 - val_loss: 0.0079
Epoch 8/50
950/950 [==============================] - 180s - loss: 8.1664e-04 - val_loss: 7.7649e-04
Epoch 9/50
950/950 [==============================] - 180s - loss: 5.9629e-04 - val_loss: 4.9810e-04
Epoch 10/50
950/950 [==============================] - 180s - loss: 4.8772e-04 - val_loss: 4.5704e-04
Epoch 11/50
950/950 [==============================] - 179s - loss: 4.1252e-04 - val_loss: 3.7326e-04
Epoch 12/50
950/950 [==============================] - 180s - loss: 3.6413e-04 - val_loss: 3.3256e-04
Epoch 13/50
950/950 [==============================] - 179s - loss: 3.2918e-04 - val_loss: 2.8421e-04
Epoch 14/50
950/950 [==============================] - 179s - loss: 2.9520e-04 - val_loss: 2.8827e-04
Epoch 15/50
950/950 [==============================] - 179s - loss: 2.7647e-04 - val_loss: 2.5144e-04
Epoch 16/50
950/950 [==============================] - 181s - loss: 2.5863e-04 - val_loss: 2.5015e-04
Epoch 17/50
950/950 [==============================] - 180s - loss: 2.4067e-04 - val_loss: 2.2645e-04
Epoch 18/50
950/950 [==============================] - 180s - loss: 2.2378e-04 - val_loss: 2.1206e-04
Epoch 19/50
950/950 [==============================] - 179s - loss: 2.1416e-04 - val_loss: 2.0406e-04
Epoch 20/50
950/950 [==============================] - 179s - loss: 2.0244e-04 - val_loss: 1.9820e-04
Epoch 21/50
 20/950 [..............................] - ETA: 170s - loss: 1.8054e-04





    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-4-5547645715ec> in <module>()
      2 noisy_movies, shifted_movies = generate_movies(n_samples=1200)
      3 seq.fit(noisy_movies[:1000], shifted_movies[:1000], batch_size=10,
----> 4         epochs=50, validation_split=0.05)

/home/valerio/anaconda3/lib/python3.5/site-packages/keras/models.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs)
    854                               class_weight=class_weight,
    855                               sample_weight=sample_weight,
--> 856                               initial_epoch=initial_epoch)
    857 
    858     def evaluate(self, x, y, batch_size=32, verbose=1,

/home/valerio/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs)
   1496                               val_f=val_f, val_ins=val_ins, shuffle=shuffle,
   1497                               callback_metrics=callback_metrics,
-> 1498                               initial_epoch=initial_epoch)
   1499 
   1500     def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):

/home/valerio/anaconda3/lib/python3.5/site-packages/keras/engine/training.py in _fit_loop(self, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch)
   1150                 batch_logs['size'] = len(batch_ids)
   1151                 callbacks.on_batch_begin(batch_index, batch_logs)
-> 1152                 outs = f(ins_batch)
   1153                 if not isinstance(outs, list):
   1154                     outs = [outs]

/home/valerio/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2227         session = get_session()
   2228         updated = session.run(self.outputs + [self.updates_op],
-> 2229                               feed_dict=feed_dict)
   2230         return updated[:len(self.outputs)]
   2231 

/home/valerio/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    776     try:
    777       result = self._run(None, fetches, feed_dict, options_ptr,
--> 778                          run_metadata_ptr)
    779       if run_metadata:
    780         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/home/valerio/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
    980     if final_fetches or final_targets:
    981       results = self._do_run(handle, final_targets, final_fetches,
--> 982                              feed_dict_string, options, run_metadata)
    983     else:
    984       results = []

/home/valerio/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1030     if handle is None:
   1031       return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
-> 1032                            target_list, options, run_metadata)
   1033     else:
   1034       return self._do_call(_prun_fn, self._session, handle, feed_dict,

/home/valerio/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1037   def _do_call(self, fn, *args):
   1038     try:
-> 1039       return fn(*args)
   1040     except errors.OpError as e:
   1041       message = compat.as_text(e.message)

/home/valerio/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1019         return tf_session.TF_Run(session, options,
   1020                                  feed_dict, fetch_list, target_list,
-> 1021                                  status, run_metadata)
   1022 
   1023     def _prun_fn(session, handle, feed_dict, fetch_list):

KeyboardInterrupt:

Test the Network



In [5]:

    
# Testing the network on one movie
# feed it with the first 7 positions and then
# predict the new positions
which = 1004
track = noisy_movies[which][:7, ::, ::, ::]

for j in range(16):
    new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
    new = new_pos[::, -1, ::, ::, ::]
    track = np.concatenate((track, new), axis=0)



In [6]:

    
# And then compare the predictions
# to the ground truth
track2 = noisy_movies[which][::, ::, ::, ::]
for i in range(15):
    fig = plt.figure(figsize=(10, 5))

    ax = fig.add_subplot(121)

    if i >= 7:
        ax.text(1, 3, 'Predictions !', fontsize=20, color='w')
    else:
        ax.text(1, 3, 'Inital trajectory', fontsize=20)

    toplot = track[i, ::, ::, 0]

    plt.imshow(toplot)
    ax = fig.add_subplot(122)
    plt.text(1, 3, 'Ground truth', fontsize=20)

    toplot = track2[i, ::, ::, 0]
    if i >= 2:
        toplot = shifted_movies[which][i - 1, ::, ::, 0]

    plt.imshow(toplot)
    plt.savefig('imgs/convlstm/%i_animate.png' % (i + 1))



In [ ]: